-
Notifications
You must be signed in to change notification settings - Fork 1k
chore: event count throttle for squashed commands #4924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: kostas <[email protected]>
thread_local size_t MultiCommandSquasher::throttle_size_limit_ = | ||
absl::GetFlag(FLAGS_throttle_squashed); | ||
|
||
thread_local util::fb2::EventCount MultiCommandSquasher::ec_; | ||
|
||
MultiCommandSquasher::MultiCommandSquasher(absl::Span<StoredCmd> cmds, ConnectionContext* cntx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used not only from async fiber but also directly from the connection. If we preempt, the connection will also "freeze". I guess this is fine, just mentioning it here for completeness.
There are 3 calls of this and all of them should be ok if we preempt from these flows.
tests/dragonfly/memory_test.py
Outdated
await cl.execute_command("exec") | ||
|
||
# With the current approach this will overshoot | ||
# await client.execute_command("multi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we also handled this case as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference? why does it not handles this case?
src/server/multi_command_squasher.h
Outdated
@@ -94,6 +104,9 @@ class MultiCommandSquasher { | |||
|
|||
// we increase size in one thread and decrease in another | |||
static atomic_uint64_t current_reply_size_; | |||
static thread_local size_t throttle_size_limit_; | |||
// Used to throttle when memory is tight | |||
static thread_local util::fb2::EventCount ec_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to avoid ThisFiber::Yield, ThisFiber::SleepFor in while(true)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's thread local, it's more efficient to use NoOpLock together with CondVarAny
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romange nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I added a bug here:
static atomic_uint64_t current_reply_size_;
so current_reply_size
is not thread local. So what can happen is:
Core 0 -> starts multi/exec
Core 1 -> starts multi/exec but needs to throttle so it goes to sleep waiting on the thread local cond variable
Core 0 -> is done, notifies the thread local
Core 1 -> the fiber never awakes even though we decremented current_reply_size
.
Since current_reply_size
is global then so should ec_
.
P.s. not very happy with this extra synchronization but we only pay it when we are under memory pressure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be thread local.
@adiholden pinging for an early discussion here |
src/server/multi_command_squasher.cc
Outdated
@@ -15,6 +16,8 @@ | |||
#include "server/transaction.h" | |||
#include "server/tx_base.h" | |||
|
|||
ABSL_FLAG(size_t, throttle_squashed, 0, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adiholden I will adjust as we said f2f. Looking for some early feedback based on our discussion
src/server/multi_command_squasher.cc
Outdated
@@ -63,6 +66,10 @@ size_t Size(const facade::CapturingReplyBuilder::Payload& payload) { | |||
} // namespace | |||
|
|||
atomic_uint64_t MultiCommandSquasher::current_reply_size_ = 0; | |||
thread_local size_t MultiCommandSquasher::throttle_size_limit_ = | |||
absl::GetFlag(FLAGS_throttle_squashed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed this morning multiply by thread count. The limit should be per thread and the current_reply_size_ is global counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I know, I even wrote a comment above that I will follow up with this 😄
I wanted to know if you have anything else to add 😄
src/server/multi_command_squasher.cc
Outdated
@@ -63,6 +66,9 @@ size_t Size(const facade::CapturingReplyBuilder::Payload& payload) { | |||
} // namespace | |||
|
|||
atomic_uint64_t MultiCommandSquasher::current_reply_size_ = 0; | |||
thread_local size_t MultiCommandSquasher::throttle_size_limit_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should multiply throttle_squashed by the number of io threads and not shard number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should no do this at all. the limit should be by thread. and in general it's not well defined to initialize thread local by using another thread local that is initialized to nullptr.
src/server/multi_command_squasher.h
Outdated
@@ -37,6 +38,15 @@ class MultiCommandSquasher { | |||
return current_reply_size_.load(std::memory_order_relaxed); | |||
} | |||
|
|||
static bool IsMultiCommandSquasherOverLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to IsReplySizeOverLimit?
src/server/multi_command_squasher.h
Outdated
// Used to throttle when memory is tight | ||
static util::fb2::EventCount ec_; | ||
|
||
static thread_local size_t throttle_size_limit_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe reply_size_limit_ ?
src/server/multi_command_squasher.cc
Outdated
@@ -15,6 +16,8 @@ | |||
#include "server/transaction.h" | |||
#include "server/tx_base.h" | |||
|
|||
ABSL_FLAG(size_t, throttle_squashed, 0, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe squashed_reply_size_limit
add flag description
also I think we should have a default limit here maybe 128_MB ?
tests/dragonfly/memory_test.py
Outdated
# At any point we should not cross this limit | ||
assert df.rss < 1_500_000_000 | ||
cl = df.client() | ||
await cl.execute_command("multi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the flow that you are testing is the multi exec flow which I did not think about. When I suggested this throttling I was thinking about the pipeline flow.
When reviewing now the multi exec flow I am not 100% sure for implying this logic in this flow as we when you do the await in the code to wait for the size to decrease we already scheduled the transaction and I am not sure if this can lead in some cases to a deadlock
src/server/server_state.h
Outdated
@@ -270,6 +270,10 @@ class ServerState { // public struct - to allow initialization. | |||
|
|||
bool ShouldLogSlowCmd(unsigned latency_usec) const; | |||
|
|||
size_t GetTotalShards() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed - you can use shard_set->size()
everywhere
src/server/multi_command_squasher.cc
Outdated
@@ -215,6 +222,9 @@ bool MultiCommandSquasher::ExecuteSquashed(facade::RedisReplyBuilder* rb) { | |||
if (order_.empty()) | |||
return true; | |||
|
|||
MultiCommandSquasher::ec_.await( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all our current approaches of limiting memory are "per-thread", this is consistent and works nicely with shared-nothing. What is the reason for not using per thread limits? In addition, we already have per thread throttling inside dragonfly_connection code, see IsPipelineBufferOverLimit. Did you consider pigging back on this mechanism instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as current_reply_size_ is global I dont see how we can do this per thread
src/server/multi_command_squasher.cc
Outdated
for (auto idx : order_) { | ||
auto& replies = sharded_[idx].replies; | ||
CHECK(!replies.empty()); | ||
|
||
aborted |= opts_.error_abort && CapturingReplyBuilder::TryExtractError(replies.back()); | ||
|
||
current_reply_size_.fetch_sub(Size(replies.back()), std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said nothing when current_reply_size_
was added. it was a mistake. I do not want anyone introduces global states in Dragonfly codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BorysTheDev FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but @romange we discussed this when current_reply_size_ was added. Because the multi command sqasher is adding replies in different threads you said it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this work?
https://github.com/dragonflydb/dragonfly/compare/RemoveAtomic?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romange The change in your branch linked above is that you count the reply size after we executed all the squashed commands. So it does uses the thread local approach correctly but what we expose to metrics is not accurate because at the time the capture reply builders grow we do not expose this until we finish with all the squashed. I think that applying such logic will impact the throttling on reply size just in delay, we will throttle but not when actually we are at the threshold but with some delay. I guess we can do this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no a "correct" solution because I can show you a scenario where single central atomic won't work either:
we throttle before we send commands to shards, but maybe there are tons of squashed commands in flight that have not filled their replies yet, so you let the next command pass and only then the reply buffer increases.
I would rather have a less accurate metric than have have all our threads contend on atomics and now on a single condvar. this kills performance. I won't be surprised that even now squashing performance is worse because of the "reply bufffer size" atomic being hammered by multiple threads.
Throttle/preempt flows that use multi command squasher and crb crosses the limit.